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^ Machine learning in automated text cate g orization 

Fabrizio Sebastiani 

March 2002 ACM Computing Surveys (CSUR), volume 34 issue i 

Full text available: Q pdf(524.41 KB ) Additional Information: full citation , abstract , references , citin gs, index t e rms 

The automated categorization (or classification) of texts Into predefined categories has 
witnessed a booming interest in the last 10 years, due to the increased availability of 
documents in digital form and the ensuing need to organize them. In the research community 
the dominant approach to this problem is based on machine learning techniques: a general 
inductive process automatically builds a classifier by learning, from a set of preclassified 
documents, the characteristics of the categories. ... 

Keywords: Machine learning, text categorization, text classification 
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A pragmatic architecture for voice dialog machines aimed at the equipment repair problem 
has been implemented. This architecture exhibits a number of behaviors required for efficient 
human-machine dialog. These behaviors include:(l) problem solving to achieve a target goal 
(2) the ability to carry out subdialogs to achieve appropriate subgoals and to pass control 
arbitrarily from one subdialog to another(3) the use of a user model to enable useful verbal 
exchanges and to inhibit unnecessary ones( ... 
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Ronald Fagin 

May 1998 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium 
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Technical pa pers: dynamic program analysis: Semantic anomaly detection in online 
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Orna Raz, Philip Koopman, Mary Shaw 

May 2002 Proceedings of the 24th international conference on Software engineering 

Full text available: ^ pd f( 1.45MB) Additional Information: full citation , abstract , references , citings, index terms 

Much of the software we use for everyday purposes incorporates elennents developed and 
nnaintained by someone other than the developer. These elennents include not only code and 
databases but also dynamic data feeds from online data sources. Although everyday software 
is not mission critical, it must be dependable enough for practical use. This is limited by the 
dependability of the incorporated elements.lt is particularly difficult to evaluate the 
dependability of dynamic data feeds, because they ... 
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Quer y evaluation techniques for large databases 
Goetz Graefe 

June 1993 ACM Computing Surveys (CSUR), volume 25 issue 2 

Additional Information: full citation , abstract , references , citin gs, index 
term s, review 



Full text available: ■g|_pdf(9,37 MB) 



Database nnanagement systems will continue to manage large data volumes. Thus, efficient 
algorithms for accessing and manipulating large sets and sequences will be required to 
provide acceptable performance. The advent of object-oriented and extensible database 
systems will not solve this problem. On the contrary, modern data models exacerbate the 
problem: In order to manipulate large sets of complex objects as efficiently as today's 
database systems manipulate simple records, query-process! ... 

Keywords: complex query evaluation plans, dynamic query evaluation plans, extensible 
database systems, iterators, object-oriented database systems, operator model of 
parallelization, parallel algorithms, relational database systems, set-matching algorithms, 
sort-hash duality 



2 WSQ/DSQ: a practical a p proach for combine d querying of databases and the Web | 
Roy Goldman, Jennifer Widom 

May 2000 ACM SIGMOD Record , Proceedings of the 2000 ACM SIGMOD international 

conference on Management of data, volume 29 issue 2 
Full text available: fl pdf (223.65 KB) Additional Information: full citation , abstract, references , citings, index 
^ terms 

We present WSQ/DSQ (pronounced "wisk-disk"), a new approach for combining the query 
facilities of traditional databases with existing search engines on the Web. WSQ, for Web- 
Supported (Database) Queries, leverages results from Web searches to enhance SQL 
queries over a relational database. DSQ, for Database-Supported (Web) Queries, uses 
information stored in the database to enhance and explain Web searches. This paper focuses 
primarily on WSQ, describing a simple, lo ... 

3 Minimunn cuts in near-linear time | 

David R. Karger 

January 2000 Journal of the ACM (JACM), volume 47 issue i 

t- ... ^ . , Additional Information: full citation , abstract , refe rences, citings, index 
Full text available: "~ — • 
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'gpdf( 216.17 KB) terms 

We significantly improve known tinne bounds for solving the minimum cut problem on 
undirected graphs. We use a "semiduallty" between minimum cuts and maximum spanning 
tree packings combined with our previously developed random sampling techniques. We 
give a randomized (Monte Carlo) algorithm that finds a minimum cut in an m-edge, n-vertex 
graph with high probability in 0(m log3 n) time. We also give a ... 

Keywords: Monte Carlo algorithm, connectivity, min-cut, optimization, tree packing 



Top-/c selection queries over relational databases: Mapping strategies and 
performance evaluation 

Nicolas Bruno, Surajit Chaudhuri, Luis Gravano 

June 2002 ACM Transactions on Database Systems (TODS), volume 27 issue 2 

Full text available- Wi pdf(1.64 MB) Additional Information: full cita tion, abstract , references , citings, index 
^ ' terms 

In many applications, users specify target values for certain attributes, without requiring 
exact matches to these values in return. Instead, the result to such queries is typically a 
rank of the "top k" tuples that best match the given attribute values. In this paper, we study 
the advantages and limitations of processing a top-k query by translating It into a single 
range query that a traditional relational database management system (RDBMS) can 
process efficiently. In particular, ... 

Keywords: Multidimensional histograms, top-/f query processing 



Searchi ng i n hi g h-dimensional spaces: Index structures for improving the performance 

of multimedia databases 

Christian Bohm, Stefan Berchtold, Daniel A. Keim 

September 2001 ACM Computing Surveys (CSUR), Volume 33 issue 3 

Full text available* odfd 39 MB) Additional Information: full citation , abstract , references , citings , index 
V ' IH^^ ' term s 

During the last decade, nnultimedia databases have beconne increasingly important in many 
application areas such as medicine, CAD, geography, and molecular biology. An important 
research issue in the field of multimedia databases is the content-based retrieval of similar 
multimedia objects such as images, text, and videos. However, in contrast to searching data 
in a relational database, a content-based retrieval requires the search of similar objects as a 
basic functionality of the database system ... 

Keywords: Index structures, indexing high-dimensional data, multimedia databases, 
similarity search 



Research sessions: similarity and matchin g : Continually evaluatin g similarity-based 
pat tern queries on a streaming time series 
Like Gao, X. Sean Wang 

June 2002 Proceedings of the 2002 ACM SIGMOD international conference on 
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c II *^ * I ui 0 no KAD\ Additional Information: full citation , abstract , referenc es, citings, index 

Full text available: pdf(1.22 MB) _ - - __a-> — — 

^^-""^^^ terms 

In many applications, local or remote sensors send in streams of data, and the system 
needs to monitor the streams to discover relevant events/patterns and deliver Instant 
reaction correspondingly. An important scenario Is that the Incoming stream is a continually 
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appended time series, and the patterns are time series in a database. At each time when a 
new value arrives (called a time position), the system needs to find, from the database, the 
nearest or near neighbors of the incoming time serie ... 

Data streams and time-series: Evaluating continuous nearest neighbor queries for 
streaming tinne series via pre-fetchin g 
Like Gao, Zhengrong Yao, X. Sean Wang 

November 2002 Proceedings of the eleventh international conference on Information 
and knowledge management 

Full text available: 1l pdf(231.86 KB) Additional Information: full citation , abstract, references, cjtings. index 

terms 

For many applications, it is innportant to quickly locate the nearest neighbor of a given time 
series. When the given time series is a streaming one, nearest neighbors may need to be 
found continuously at all time positions. Such a standing request is called a continuous 
nearest neighbor query. This paper seeks fast evaluation of continuous queries on large 
databases. The initial strategy is to use the result of one evaluation to restrict the search 
space for the next. A more fundamental i ... 

Keywords: continuous query, nearest neighbor, streaming time series 



8 Multidimensional divide-and-conquer 
Jon Louis Bentley 

April 1980 Communications of the ACM, volume 23 issue 4 

Full text available: pdf(1 .73 MB) Additional Information: full citation , abstract , refere n ces , citing s 

Most results in the field of algorithm design are single algorithms that solve single problems. 
In this paper we discuss multidimensional divide-and-conquer, an algorithmic paradigm that 
can be instantiated in many different ways to yield a number of algorithms and data 
structures for multidimensional problems. We use this paradigm to give best-known 
solutions to such problems as the ECDF, maxima, range searching, closest pair, and all 
nearest neighbor prob ... 

Keywords: algorithmic paradigms, analysis of algorithms, closest-point problem, 
computational geometry, data structures, empirical cumulative distribution functions, 
maxima problems, multidimensional searching problems, range searching 

^ Machine learnin g in automated t ext categorization 
Fabrizio Sebastiani 

March 2002 ACM Computing Surveys (CSUR), volume 34 issue i 

Full text available- #1 Ddfr524 41 KB) Additional Information: full citati on, abstract , referenc es, citings, index 
' terms 

The automated categorization (or classification) of texts into predefined categories has 
witnessed a booming interest in the last 10 years, due to the increased availability of 
documents in digital form and the ensuing need to organize them. In the research 
community the dominant approach to this problem is based on machine learning techniques: 
a general inductive process automatically builds a classifier by learning, from a set of 
preclassified documents, the characteristics of the categories. ... 

Keywords: Machine learning, text categorization, text classification 
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Edgar Chavez, Gonzalo Navarro, Ricardo Baeza-Yates, Jose Luis Marroquin 
September 2001 ACM Computing Surveys (CSUR), Volume 33 issue 3 

Full text available: « pdf( 916,04 KB) Additional Information: Mdtation. abstract, references, dtiogs. index 

terms 

The problenn of searching the elements of a set that are close to a given query elennent 
under some similarity criterion has a vast number of applications in many branches of 
computer science, from pattern recognition to textual and multimedia information retrieval. 
We are interested in the rather general case where the similarity criterion defines a metric 
space, instead of the more restricted case of a vector space. Many solutions have been 
proposed in different areas, in many cases without cros ... 

Keywords: Curse of dimensionality, nearest neighbors, similarity searching, vector spaces 



Measurin g pra ise and criticism: Inference of semantic orientation from association 
Peter D. Turney, Michael L. Littman 

October 2003 ACM Transactions on Information Systems (TOIS), volume 21 issue 4 
Full text available: '^pdf( 640.81 KB) Additional Information: full citation , abstract , references , index terms 

The evaluative character of a word is called its semantic orientation. Positive semantic 
orientation indicates praise (e.g., "honest", "intrepid") and negative semantic orientation 
indicates criticism (e.g., "disturbing", "superfluous"). Semantic orientation varies in both 
direction (positive or negative) and degree (mild to strong). An automated system for 
measuring semantic orientation would have application in text classification, text filtering, 
tracking opinions in online discussions ... 

Keywords: latent semantic analysis, mutual information, semantic association, semantic 
orientation, text classification, text mining, unsupervised learning, web mining 



2 The state of the art in di strib uted query processing 
Donald Kossmann 

December 2000 ACM Computing Surveys (CSUR), volume 32 issue 4 

Full text available: g pdf(455.39 KB) Additional Information: MLQ^-H- ^Mract. references , citings, index 

Distributed data processing is becoming a reality. Businesses want to do it for many 
reasons, and they often must do it in order to stay competitive. While much of the 
infrastructure for distributed data processing is already there (e.g., modern network 
technology), a number of issues make distributed data processing still a complex 
undertaking: (1) distributed systems can become very large, involving thousands of 
heterogeneous sites including PCs and mainframe server machines; (2) the stat ... 

Keywords: caching, client-server databases, database application systems, dissemination- 
based information systems, economic models for query processing, middleware, multitier 
architectures, query execution, query optimization, replication, wrappers 



System R: r elati onal a p proach to database management 

M. M. Astrahan, M. W. Blasgen, D. D. Chamberlin, K. P. Eswaran, J. N. Gray, P. P. Griffiths, W. 
F. King, R. A. Lorie, P. R. McJones, J. W. Mehl, G. R. Putzolu, I. L. Traiger, B. W. Wade, V. 
Watson 

June 1976 ACM Transactions on Database Systems (TODS), volume 1 issue 2 

.-Ml * -. L.. 0t jf/oHon/iDN Additional Information: full citation , abstract , references , citings , index 

Full text available: 'ra pdf ( 3.18 MB) ; ^ 
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System R is a database management system which provides a high level relational data 
interface. The systems provides a high level of data independence by isolating the end user 
as much as possible from underlying storage structures. The system permits definition of a 
variety of relational views on common underlying data. Data control features are provided, 
including authorization, integrity assertions, triggered transactions, a logging and recovery 
subsystem, and facilities for maintaining ... 

Keywords: authorization, data structures, database, index structures, locking, 
nonprocedural language, recovery, relational model 



Technical r e ports 

SIGACT News Staff 

January 1980 ACM SIGACT News, volume 12 issue 1 

Full text available: ' ^pdf(5.28 MB) Additional Information: full citation 



^5 Information retrieval on the web 

Mei Kobayashi, Koichi Takeda 

June 2000 ACM Computing Surveys (CSUR), volume 32 issue 2 

Full text available- "SI pdf( 213 89 KB) Additional Information: full citation , abstrac t, references , citings, index 
• 1^ — ■ terms 

In this paper we review studies of the growth of the Internet and technologies that are 
useful for infornnation search and retrieval on the Web. We present data on the Internet 
from several different sources, e.g., current as well as projected number of users, hosts, 
and Web sites. Although numerical figures vary, overall trends cited by the sources are 
consistent and point to exponential growth in the past and In the coming decade. Hence It Is 
not surprising that about 85% of Internet user ... 

Keywords: Internet, World Wide Web, clustering, indexing, Information retrieval, 
knowledge management, search engine 
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September 2001 Journal on Educational Resources in Computing (JERIC) 

Full text available: W\ pdf(613.63 KB) . . 

^ L .^r^^ Additional Infornnation: full citation , referen ces, citings, index terms 
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L earning search en g ine specific q uer y transformations for question a n swering 

Eugene Aglchtein, Steve Lawrence, Luis Gravano 

April 2001 Proceedings of the tentli international conference on World Wide Web 

Full text available: "gj pdf(205.68 KB ) Additional Information: full citation , reference s, citin gs, index ter ms 



Keywords: Information retrieval, query expansion, question answering, web search 



Data structures and al g orithms for nearest ne ig hbor search in g eneral metric spaces Q 
Peter N. Yianilos 

January 1993 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete 
algorithms 
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Keywords: associative memory, clustering, computational geometry, metric space, nearest 
neighbor, pattern recognition, randomized methods 



19 In dex-driven similarity search in metric spaces 
Gisli R. Hjaltason, Hanan Samet 

December 2003 ACM Transactions on Database Systems (TODS), volume 28 issue 4 
Full text available: pdf(650.64 KB ) Additional Information: full citation , abstract , references , jn dex terms 

Similarity search is a very important operation in multimedia databases and other database 
applications involving complex objects, and involves finding objects in a data set S similar to 
a query object q, based on some similarity measure. In this article, we focus on methods for 
similarity search that make the general assumption that similarity is represented with a 
distance metric d. Existing methods for handling similarity search in this setting typically fall 
into one of ... 

Keywords: Hiearchical metric data structures, distance-based indexing, nearest neighbor 
queries, range queries, ranking, similarity searching 



20 Performance issues: A heuristic a p proach to attribute partitioning 

Michael Hammer, Bahram Niamir 

May 1979 Proceedings of the 1979 ACM SIGMOD international conference on 
Management of data 

Full text available: g pdf(1 .22 MB ) Additional Information: full citation , abstrac t, referen ces, dtings 

One technique that is sometimes employed to enhance the performance of a database 
management system is known as attribute partitioning. This is the process of dividing the 
attributes of a file into separately stored subfiles. By storing together those attributes that 
are frequently requested together by transactions, and by separating those that are not, 
attribute partitioning can reduce the number of pages that are transferred from secondary 
storage to primary memory in the processing of a tran ... 
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